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Abstract 


It is intuitively appealing to try to combine eve-tracking data and verbal reports when 
investigating medical image interpretation. However, before collecting such data, 
important decisions must be made, including exactly when and how to collect the verbal 
reports. The purpose of this methodological article is to reflect on the pros and cons of 
different solutions and to offer some guidelines to investigators. We start by exploring the 
ontology of vision and speech production and the epistemology of eye movements to 
grasp what fixations and verbal reports actually reflect. We are also interested in the 
major constraints of the two systems. Second, we elaborate on two dominant 
investigational approaches to verbal accounts: concurrent think-aloud and Chi’s 
explanations. Later, we move on to other approaches. Third, we present and critically 
evaluate studies from the literature on medical image interpretation, specifically ones 
that have sought to contrast or integrate eve-movement data and verbal reports. Fourth, 
we conclude with some practical guidelines and suggestions for further research. 
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1. Introduction 

The study of medical expertise in visual domains, such as radiology and dermatology, is firmly 
rooted in two distinct investigational approaches, both of which serve certain purposes: (a) the study of 
visual search or perception using eye-tracking methods (e.g., Berbaum et al, 1998; Kundel, Nodine, & 
Carmody, 1978; Krupinski et al., 2006; Rubin et al., 2014) and (b) the study of clinical reasoning, usually 
employing verbal reports (Azevedo, Faremo, & Lajoie, 2007; Lesgold, Feltovich, Glaser, & Wang, 1981; 
Morita et al., 2008; van der Gijp et al., 2015). Before the advent of commercial eye trackers, verbal reports 
were basically the only way to gain insight into diagnostic reasoning. Even today, some type of verbal report 
is needed because one cannot deduce from, for example, dwell times, whether a viewer actually “sees” a 
lesion (Berbaum, Franken, Dorfman, Caldwell, & Krupinski, 2000). A crucial part of the perceptual process 
is assigning meaning to what one sees (Nodine & Kundel, 1987). As for the value of eye tracking, Krupinski 
(2006) argued that eye tracking may be useful for developing individual eye movement profiles and for 
understanding the difference in performance between novices and experts. In addition, they can be useful for 
developing new visual search strategies. 

Although studies following both lines of investigation have shown important insights, one can 
question whether either of the approaches alone is sufficient enough to answer important research questions. 
It is hard to see how medical image perception investigators are meeting the expectations of modeling, for 
instance, search strategies by relying on eye movement metrics alone. It also hard to see how process models 
can be justified based on only one source of data. As for the protocol analysts, it is odd that, for example, van 
der Gijp et al. (2014) conceptualized the interpretation of radiological images as a process of perception, 
analysis, and synthesis but methodologically relied on concurrent think-aloud techniques without the use of 
eye tracking. 

In fact, it has been argued in the context of occupational psychology that complex cognitive work 
tasks should be studied by integrating various sources of information, including eye movement data, when 
appropriate, with verbal reports (Patrick & James, 2004; Gegenfurtner et al., 2017). Patrick and James (2004) 
stressed that process tracing involves four stages, and important decisions have to be made in each stage. The 
stages are the following: (1) collection of data, (2) transcription, integration, and segmentation of the data 
into a time-lined account, (3) coding, and (4) further analysis of the data from Stage 3 and representation of 
the data. In the data collection stage, one of the most critical decisions involves the timing of data collection, 
because verbal accounts can be collected concurrently with task performance or retrospectively. As for the 
transcription phase of verbal reports, the authors present the integrated actions of a person in a single table. 
Alternatively, one could think of either a data matrix containing a time-lined account of actions that is 
obtained through eye-tracking software or a set of time-stamped, transcribed videos. Stage 3 involves coding 
of the transcribed data either based on theoretical categories or done in a bottom-up fashion. The authors 
stressed that when categories are derived from a bottom-up approach, independent raters should refine 
categories iteratively, with some form of reliability being reported. In Stage 4, the analyst filters or expands 
the data using the newly acquired codes from Stage 3, and subjects the data to further analysis, whereby 
certain aspects of cognition are made more salient. The authors stress two points: (a) a minimum level of 
further analysis is whether a worker’s response or solution is correct; (b) there is a need to capture and 
represent at a global level a person’s reasoning during a scenario in relation to changes in the task and work 
situation. 

The purpose of this methodological article is to reflect on the pros and cons of different solutions and 
to offer some guidelines for investigators. It is stressed that this endeavor stretches the frontiers of the field 
because Gegenfurtner, Siewiorek, Lehtinen, and Saljo (2013) reported in their systematic review that 
combining eye tracking and verbal reports remains unexplored. Also, this article is not a literature review. To 
review articles other than the one by Gegenfurtner et al. (2017), see Al-Moteri, Symmons, Plummer, & 
Cooper (2017); Blondon, Wipfli, and Lovis (2015); and van der Gijp et al. (2016). We start by exploring the 
ontology of vision and speech production and the epistemology of eye movements to grasp the major 
constraints involved in visual processing and speech production. Second, we elaborate on two dominant 
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investigational approaches to verbal accounts and introduce alternative approaches. Third, we present and 
evaluate studies from the literature on medical image interpretation that have sought to contrast or integrate 
eye movement data and verbal reports. Fourth, we conclude with practical solutions and some suggestions 
for further research. 


2. Nature of the visual system: What do fixations actually reflect? 

The visual system is the part of the nervous system that allows organisms to see. It interprets 
information from the environment to build a representation of the surrounding world. The visual system has 
the complex task of reconstructing a three-dimensional world from a two-dimensional retinal representation 
of that world. The performance of the visual system in a constantly-changing visual environment is 
remarkable. The price, however, is that approximately one-third of the cortex is needed to process visual 
information (Vanni, 2004). 

Then, how does the visual system operate? Information from the eyes flows into the brain through 
the optic nerve. Information from the right visual field travels to the left optic tract. Information from the left 
visual field travels to the right optic tract. Each optic tract terminates in the LGN in the thalamus. The region 
that receives information directly from the LGN is called the VI. The Macakee Ape has over 30 cortical 
regions, and it is estimated that humans have approximately as many (Vanni, 2004). These areas are 
connected to each other by an intricate wiring containing both feedforward and feedback connections 
(Vanni, 2004). According to the ventral-dorsal model introduced by Goodale and Milner (1992), information 
flows in two directions from the primary visual cortex: (a) to the posterior parietal cortex through the dorsal 
stream and (b) to the inferotemporal cortex through the ventral stream. The dorsal pathway has been 
characterized as the action stream, a pathway concerned with converting visual inputs into motor outputs, 
whereas the ventral pathway provides a visual perception of objects and events in the world. Goodale (1998) 
stressed, however, that even a simple action such as picking up a cup a coffee requires activity in both 
pathways. 

There are several bottlenecks in the visual system that stem from constraints in anatomy, attention, 
and working memory. First, high visual acuity is limited to the fovea, a spot on the retina. The fovea is 
employed for accurate vision in the direction where it is pointed. Visual acuity decreases dramatically in the 
parafoveal area and periphery. In eye-movement research, it is possible to capture the target of foveal 
inspection through fixations. Second, object recognition is limited by capacity and often attention-demanding 
because one cannot recognize multiple objects with more than one feature simultaneously (such as a letter T 
containing green and purple). Object recognition requires more than 100 ms per item, which refers to 
processing time instead of presentation time (Wolfe, Vo, Evans, & Greene, 2011). Third, there is a limit to 
focusing and shifting one’s attention: people tend to move their eyes between two to four times per second 
when reading and conducting most visual search tasks (Salthouse & Ellis, 1980). The gaze, however, can be 
trained to make the best out of the few fixations: the novice’s gaze is often drawn by salient, bottom-up 
features, whereas experts more often focus on top-down, task-relevant features, as evidenced by Bertram, 
Helle, Kaakinen, and Svedstrom (2013). (See also Wolfe, Evans, Drew, Aizenman, & Josephs, 2015). 
Fourth, although information flows into the system incessantly, working-memory capacity is limited to 
approximately four “chunks,” or combinations of items, at a time (Cowan, 2010). In addition, information in 
one’s working memory is lost quickly: according to Ericsson (2006), for tasks with response latencies of 5- 
10 seconds, people are able to recall their sequences of thoughts quite accurately. 

For the main part, the human brain processes low-level information patterns in the environment 
automatically (Vanni & Heikkinen, 2015). Studies adhering to a flash-view paradigm (i.e., presenting images 
to participants for 20-250 ms) have shown that people can partially infer a scene without even fixating on 
the scene (e.g., Kirschner & Thorpe, 2006). Only a small part of the information reaching the cortex is 
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processed further, with storage capacity representing yet another filter. Thus, visual information processing 
reaching awareness is only the tip of the iceberg. 

People use fixations to purposively sample information from their surroundings to reconstruct a 
representation of the surrounding world. However, studies adhering to a flash-view paradigm have shown 
that people can partially infer a scene without even fixating on the scene. Thus, there appears to be two 
visual pathways, coined a selective pathway involving purposive sampling and a nonselective pathway by 
Wolfe et al. (2011). To answer the question in the title, the sequence of fixations can be seen as reflecting a 
visual search through the selective pathway (i.e., attentional guidance). 


3. Speech production and verbal reports: What do verbal reports actually reflect? 

3.1 Speech production 

How people produce and why they produce speech is usually taken for granted. Speech has many 
social and cultural functions, such as signifying group identity, social grooming, settling disputes, teaching, 
and entertainment. Naturally, the function of each act of speech shapes speech production, which is a rather 
complex process. According to Levelt (1989, pp. 4-14), speech production involves four stages (originally 
conceptualized as “processing components”) that depend heavily on “knowledge stores”: (a) 
conceptualization (i.e., preverbal message generation relying on situational knowledge and content 
knowledge); (b) formulation, including grammatical and phonological encoding relying on lexical 
knowledge; (c) articulation (i.e., execution of the phonetic plan by three sets of muscles involving up to 100 
different muscles) resulting in overt speech; and (d) self-monitoring (i.e., the normal components of normal 
language comprehension relying on lexical knowledge). Interestingly, the model includes the notion of inner 
speech, which is the product of the second phase. The model does not include writing as an alternation to 
articulation, but speech can be encoded into the visual or tactile form in addition to the auditory form. As a 
result, people manage to produce two to three words per second as a part of fluent conversation, and overtly 
naming a clear picture of an object can be initiated within 600 ms after the appearance of the picture (Levelt, 
Roelofs, & Meyer, 1999). In fact, the generation of inner speech may be somewhat ahead of articulation. To 
cope with asynchrony, it is necessary for the phonetic plan to be stored. The storage mechanism is referred to 
as the articulatory buffer. It is important to note that these actions tax the speaker’s information-processing 
capacity, including working memory; in addition, speech production is delayed compared to recognition by 
the visual system. (Recall that object recognition requires 100 ms processing time.) 

3.2 Two dominant approaches to verbal reports 

In a research context, verbal reports are heavily shaped by the context in which they are produced, 
and verbal reports serve various functions in different research traditions. This can be highlighted by 
comparing two dominant approaches to verbal reports: Ericsson’s protocol analysis and Chi’s explanations. 
Ericsson and Simon’s “Verbal Reports as Data” (1980), with over 13,800 Google citations and 1,619 Web of 
Science citations, appears to be the most influential piece of work on verbal reports. Based on Google 
Scholar, Ericsson has been the most active author on verbal reports over the last 30 years. Second to Ericsson 
and Simon’s article is an article by Micheline Chi: “Quantifying Qualitative Analysis of Verbal Data: A 
Practical Guide.” This paper has over 1,490 Google citations and over 480 Web of Science citations. These 
two approaches are also frequently used in the context of medical image interpretation. Therefore, Ericsson’s 
protocol analysis and Chi’s explanations deserve sections of their own. 
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According to Ericsson (2006, p. 227), the central assumption of protocol analysis is that it is 
possible to instruct people “to verbalize their thoughts in a manner that does not alter the sequence and 
content of thoughts mediating the completion of a task and therefore should reflect immediately available 
information during thinking”. Using Levelt’s terminology, Ericsson is after “inner speech”. In other words, 
the purpose is to elicit concurrent, nonreactive reports of thinking to understand expert reasoning and 
performance. According to the expert performance approach, the best way to obtain valid and complete 
traces of expert thought is to strive to produce laboratory conditions that capture “the essence of expertise,” 
where participants perform tasks that are representative of the studied phenomenon and where verbalizations 
directly reflect the participants’ spontaneous thoughts that are generated while completing the task. The 
instructions can be as follows (Ericsson & Simon, 1993, p. 376): 

“In this experiment, we are interested in what you say to yourself as you perform some 
tasks that we give you. In order to do this we will ask you to TALK ALOUD as you work 
on the problems. What I mean by talk aloud is that I want you to say out loud everything 
that you say to yourself silently. Just act as if you are alone in the room speaking to 
yourself. If you are silent for any length of time I will remind you to keep talking aloud.” 

In contrast, the goal of Chi’s explanations (1997) is to figure out what a learner knows based on 
what a learner says or does and how that knowledge influences the learner’s reasons. Chi avoided giving 
detailed instructions on how verbal reports should be elicited. Instead, she gave detailed instructions on the 
analysis of such reports. She stressed that one must first determine “what” the learner said (e.g., a set of 
propositions or concepts). However, after that, to determine the overall structure of knowledge 
representations, one must assess the relations between the set. For example, a learner with naive conceptions 
can hold pieces of unrelated knowledge, or a learner’s knowledge set can be theory-like, meaning that the 
reasoning can be captured by a few principles. 

According to Chi (1997), the method of coding and analysing verbal data consists of the following 
eight steps: 

1. Reducing or sampling the protocols 

2. Segmenting the reduced or sampled protocols (sometimes optional) 

3. Developing a coding scheme or formalism 

4. Operationalizing evidence in the coded protocols 

5. Depicting the mapped formalism (optional) 

6. Seeking a pattern in the mapped formalism 

7. Interpreting the patterns 

8. Repeating the entire process, perhaps adopting a different grain size (optional). 

According to Chi (1997), there are five key differences between Ericsson’s protocol analysis and her 
verbal analysis. First, there is a clear juxtaposition in the way the verbal reports are collected. Ericsson and 
Simon (1993) underlined that research participants are simply verbalizing the information they attend to 
while generating an answer to a problem instead of describing, explaining, justifying, or rationalizing their 
actions. Second, there is a difference in focus. Ericsson and Simon (1994) were concerned with tapping the 
online process of problem solving or decision making, whereas Chi was interested in capturing the 
participants’ knowledge representations. She even argued that the goal of protocol analysis is to test the a 
priori model rather than to uncover what the participants are actually doing. The third difference has to do 
with analytical procedures and workloads. According to Chi (1997), in protocol analysis, coming up with the 
ideal template, which requires a cognitive task analysis, represents the majority of the workload. In contrast, 
in verbal analysis, the referents are unknown; in self-explanation data, one must determine what the 
participant is talking about (e.g., an inference, plan, or inquiry). Fourth, the method of validation or testing is 
different for the two methods. In Ericsson’s protocol analysis, the sequence of verbal utterances is simply 
compared to the ideal template. The validation of the protocol analysis is “the degree of match” between 
these two. In the verbal analysis method, validation is achieved by using statistical testing. For example, 
qualitatively different knowledge representations of different groups of participants can be checked against 
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the answers to some subject-specific questions. Ericsson (2006) pointed out that task analysis can be applied 
to the analysis of think-aloud protocols. However, he added that it is also possible to examine the convergent 
validity established by different types of data, including reaction times, error rates, patterns of brain 
activation, and sequences of eye movements. 

Ericsson’s protocol analysis has several advantages. An obvious advantage is that Ericsson provided 
detailed instructions on how to collect data using the method. The other advantage is that Ericsson and 
Simon (1993) provided a wealth of evidence indicating that the method is not reactive (i.e., it does not alter 
the course of cognitive processing). The main disadvantages are the following: (a) As Morita et al. (2008) 
noted, medical image interpretation involves an implicit process that is difficult to verbalize; (b) thinking- 
aloud generally slows down performance, which may disrupt the execution of dynamic tasks in particular; 
(c) in certain tasks, it has been shown to alter accuracy (Russo, Johnson, & Stephens, 1989). 

The disadvantage of prompting for explanations in the middle of an activity is that it has been 
repeatedly shown to affect behaviour in multiple ways (Ericsson & Simon, 1993). A more recent study 
exploring visual search behaviour on different sets of Web pages showed that prompting for explanations not 
only prolonged the task, but also led to more general distributed visual behaviour and the issuing of more 
commands to navigate within and between the Web pages. In addition, mental workload increased (Hertzum, 
Hansen, & Anderson, 2009). Another disadvantage of Chi’s explanations is the lack of clear instructions on 
how to collect “explanations.” If explanations are required, some form of retrospective reporting should be 
seriously considered. 


3.3 Retrospective reporting 

In fact, people can provide quite accurate retrospective reports for short tasks that take 5-10 seconds 
(Ericsson, 2006). The instruction can be as follows: “Can you please tell me what you were thinking during 
problem solving?” (van Gog, Paas, van Merrienboer, & Witte, 2005). In the context of medical image 
interpretation, it is also common to ask the participants to report on the findings and final diagnosis either 
orally or in writing. As Patrick and James (2004) pointed out, ideally, verbal reports should be collected 
immediately after task completion while the participant’s short-term memory still holds relevant information. 
According to the authors, when there is a need to rely on the participant’s long-term memory, some type of 
retrieval cues should be designed. In the case of medical images, which require more than 5-10 seconds to 
interpret, showing the participants a dynamic presentation of their eye movements (and keyboard movements 
when applicable) would seem to be a viable cuing solution. 

There have been some noteworthy efforts to compare concurrent think-aloud with retrospective 
reports and cued reporting. In the context of troubleshooting electric circuits, van Gog, Paas, van 
Merrienboer and Witte (2005) conjectured that the methods would extract different types of information 
regarding process tracing. The authors did not report the duration of the troubleshooting tasks, but it seems 
safe to assume that the task durations exceeded the critical limit of 5-10 seconds. Thus, it was not surprising 
that the concurrent think-aloud and cued retrospective reporting involving showing the participants their eye 
movements and keyboard strokes resulted in more information than retrospective reporting. The remarkable 
finding was that the cued retrospective method resulted in less theoretical meaning-making verbalizations 
(why utterances), whereas the concurrent method resulted in less metacognitive utterances. Thus, in addition 
to the context and population, one needs to consider carefully the type of information one is seeking. 

The advantage of retrospective reporting is that the method of verbalization does not interfere with 
task completion. If the task is of a very short duration (under 10 seconds), accurate reports of thinking 
processes can be expected. The advantage of cued retrospective reporting is that it does not interfere with 
task performance. However, it may be that the cues are not sufficient enough to recover all task-related 
information. 
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It is worth noting that Morita et al. (2008) showed that it can be worthwhile to triangulate verbal 
reports obtained through different methods. Their results indicated that experts use more conceptual words in 
thinking-aloud through a visual task, but they use more perceptual words when compared to novices in the 
writing of the report. The interpretation of the finding was that the development of expertise is based on an 
ability to build connections between percepts and concepts. 


4, Critical examination of studies combining eye tracking and verbal reports in the context 
of medical image perception 

Although there are many arguments for promoting the combination of eye tracking and verbal 
reports, combining eye tracking and verbal reports is easier said than done, as can be seen from the following 
studies employing concurrent think-aloud. Concurrent think-aloud attempts to capture nonreactive verbal 
reports of thinking (Ericsson, 2006). The notion of “nonreactivity” means that the execution of the primary 
task is not affected, except for the fact that it may be prolonged. The participants are asked to perform a task 
while uttering briefly what spontaneously comes to mind. In other words, it aims to “vocalize inner speech.” 
Ericsson emphasized numerous times that participants should be talking to themselves, not explaining what 
they are doing or why because it has been repeatedly shown that the act of explaining can seriously interfere 
with the task the investigators are trying to model. 

The first efforts to triangulate different sources of data obtained from different studies date back to 
the year 2000. Berbaum et al. (2000) were interested in conducting a congenially designed laboratory 
experiment to determine if satisfaction of search is because of recognition error or because of decision error 
by two different methods (eye tracking versus protocol analysis). The design involved inserting artificial 
lesions in an image to see if it decreases the detection of native lesions, indicating satisfaction of search 
(SOS). An earlier study employing eye tracking had indicated that inserting artificial lesions to certain 
images decreased the reporting of native lesions on those images. In the new experiment, Berbaum et al. 
(2000) discovered two important things: First, the think-aloud condition served to eliminate the satisfaction 
of the search effect. Second, the two methods provided contradictory results: the eye-tracking study 
suggested SOS was because of decision error, whereas the think-aloud study suggested that SOS was 
because of recognition error. The authors concluded that protocol analysis is limited in its ability to 
differentiate between search error and recognition error. On the other hand, there are perils in assuming that 
a lesion has been recognized based on dwell time alone. Thus, it was hard to reconcile the fact that the two 
studies produced contradictory findings. Also, the fact that the think-aloud procedure affected performance 
on the primary task casts doubts on the integrity of the entire study: it is difficult to argue that the thi nk -aloud 
procedure was nonreactive. We speculate that the reactivity was because of the instructions given; the 
observers were instructed to use a finger to point to where they were looking at and to verbalize the 
structures and the features they were looking for. These early efforts highlight the difficulties investigators 
can experience in applying concurrent think-aloud and method triangulation. 

In fact, the first fundamental issue to consider is whether the concurrent think-aloud condition 
interferes with the primary task. To our knowledge, only a single study has been conducted on this issue in 
the context of medical image interpretation. Littlefair, Brennan, Reed, Williams, and Pietrzyk (2012) 
explored whether the think-aloud condition affects pulmonary node detection; they did this using a within- 
subjects design with seven participants, two viewing sessions with a “wash-out period” separating them, and 
a set of 30 two-dimensional radiographs. Half of the radiographs contained a single artificial nodule, and the 
rest were non-nodular. The participants were informed that the radiographs may contain a single nodule. No 
time limit was set for viewing. Performance was evaluated in terms of sensitivity, specificity, and ROC 
measures, including multicase multireader ROC AUC analysis. In addition, the participants’ eye movements 
were tracked to compare, for example, fixations of areas of interest and time to fixate on areas of interest. 
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Results indicated that only half of the nodules ended up correctly localized, indicating an absence of ceiling 
effects. There were no differences in performance under the two conditions, with the exception of confidence 
ratings (in the TA condition, the subjects were less confident) and task duration. The latter result was 
statistically significant. The results are well-aligned with Ericsson and Simon’s (1993) theoretical account. 

Concurrent think-aloud has also been used, rather surprisingly, in the context of dynamic stimuli. An 
alternative approach situated in the context of fish locomotion can be seen in Jarodzka, Scheiter, Gerjets, and 
van Gog (2010). Balslev et al. (2012) used think-aloud in the context of viewing films depicting infants with 
seizures and conditions resembling seizures. Balslev et al. (2012) had their participants (medical students, 
residents, and experts) think-aloud while diagnosing the infant seizures presented in the short films, which 
lasted anywhere from 26-49 seconds. The films were looped and repeated until the observer wished to stop 
viewing. Not surprisingly, the experts scored higher in diagnostic accuracy and spent relatively more time 
viewing task-relevant features. A content analysis of the verbal accounts revealed that experts engaged more, 
in relative terms, exploring the material and spent more time building and evaluating hypotheses. This 
pattern, in turn, explained why the experts returned to the areas of interest. This study showed how the 
combined use of eye movements and verbal reports can lead to a better understanding of medical image 
interpretation. 

Finally, Li, Pelz, Aim, and Haake (2012) attempted to integrate eye movement information 
completely with the concurrent verbalizations of a group of dermatologists who differed in their level of 
training; the groups were asked to observe 42 two-dimensional dermatological images. Subsequently, they 
developed a hierarchical probabilistic framework to extract unique and common eye movement patterns 
among multiple subjects within each expertise group. The idea was to map specific eye movement patterns 
to certain cognitions, such as identifying the primary morphology. Although the study is a remarkably 
ambitious endeavor to integrate eye tracking and concurrent verbalizations, as a process-tracing study, the 
work suffers from the implementation of the concurrent verbalizations. The novices were requested to 
provide a detailed description of the materials “as if describing to their doctors over the phone.” The medical 
professionals were instructed to examine and describe the findings to students “as if teaching.” (ibid., p. 395) 
These are clear violations of the principle of focusing on inner thought, and asking these questions may have 
seriously interfered with the primary task of image interpretation. Therefore, the mapping solution presented 
may have limited value. 

Retrospective verbalization is a suitable option for very short tasks where there is simply not 
enough time to verbalize. Jaarsma, Jarodzka, Nap, van Merrienboer, and Boshuizen (2014) applied a heavily 
time-constrained research design. The authors presented two-dimensional microscopic images of colon tissue 
to a group of clinical pathologists, pathology residents, and medical students. The viewing of the images was 
constrained to 2 seconds. The participants’ eye movements were registered, along with their post hoc verbal 
accounts of what they had seen. (The authors did not use the expression “retrospective think-aloud”; instead, 
they referred to post hoc verbalizations.) The investigators analyzed the two sources of data separately. The 
verbal accounts were analyzed through an elaborate content analysis. The most interesting findings related to 
the differences between the clinical pathologists and the residents: in their search, the clinical pathologists 
tended to rely on what they had already seen, further studying the image for other abnormalities, whereas the 
residents tended to double-check their initial findings. In their post hoc verbalizations, the clinical 
pathologists focused on the typicality of the tissue, whereas the residents concentrated on naming 
pathologies. This study showed that important insights can be gleaned by combining eye movements and a 
form of retrospective verbal reports. 

Interestingly, not a single study could be found where the authors reported collecting the data by 
using Chi’s explanations as the method. Instead, going through the literature revealed several studies where 
the investigators collected the data using concurrent think-aloud and then referring to Chi (1997) in the 
analysis phase (Azevedo et al., 2007; van der Gijp et al., 2014; van der Gijp et al., 2015). It is not unusual to 
find studies using concurrent verbalizations, which end up gathering explanations (e.g., Li et al., 2012; van 
der Gijp et al., 2015). 
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5. Conclusions 

The integration of eye tracking and verbal reports is intuitively appealing, and as illustrated in this 
methodological article, interesting insights can be gleaned by adopting a mixed-methods approach. However, 
before collecting such data, important decisions must be made. The first critical decision is timing, that is 
whether to collect concurrent or retrospective data. In the case of concurrent think-aloud, a decision must be 
made whether to follow Ericsson and Simon’s or Chi’s advice. In the case of retrospective reports, one must 
decide whether to play back the eye movements to the observer to aid the retrieval of information from long¬ 
term memory. Based on this methodological analysis, some methodological issues appear to be solved, 
whereas others require further investigation. 

We argue that two issues are solved: First, retrospective reporting without cuing is suitable for 
perceptual tasks of a very short duration (<10 seconds). The advantages are the following: (a) one can be 
certain that verbalization does not interfere with the primary task; (b) the observer’s verbalization is not 
constrained to the speed of the visual system; and (c) it is safe to assume that information is still available in 
short-term memory. Second, there exists a compelling body of literature indicating that for the purposes of 
process tracing, Ericsson’s nonreactive method is superior to the idea of soliciting direct explanations 
from the observers during task execution because soliciting explanations tends to interfere with the 
primary task. It is hard to see what the purpose of process tracing would be if the research method results in a 
substantial change in the primary activity. 

Other issues remain underexplored. First, when tasks are longer than 10 seconds, the pros and cons 
of Ericsson’s concurrent think-aloud versus cued retrospective reporting need to be weighed against each 
other. In the study by van Gog et al. (2005) in the context of troubleshooting electric circuits, the concurrent 
think-aloud condition produced more theoretical expressions, which were to some extent lost in the 
stimulated recall condition. We emphasize that this issue has not been explored in the context of medical 
image interpretation. More fundamentally, there is a need for more studies to be conducted to show that 
perceptual tasks, such as viewing an X-ray, are not affected by the think-aloud condition. The study by 
Russo et al. (1989) showed that even when experimenters stick meticulously to Ericsson and Simon’s 
instructions, the think-aloud condition may affect task performance. Russo et al. (1989, p. 758) concluded 
that “protocol validity should be based on an empirical check rather than theory-based assurances”. 

What is proposed is a research agenda with two goals: (a) to further explore if think-aloud affects 
performance in a range of image interpretation tasks; (b) to compare the type of information obtained by 
concurrent think-aloud and cued retrospective reporting with different types of material (two-dimensional, 
volumetric, video). It would also be useful to include observers with varying levels of experience. 


Keypoints 

9 Before attempting to combine eye-tracking data and verbal accounts, important decisions must be 
made regarding the timing of the verbalizations and possible cuing. 

9 Ericsson’s concurrent think-aloud is deemed superior to eliciting explanations from the observers 
during task performance. 

9 Retrospective think-aloud is suitable for tasks of a very short duration (<10 seconds). 

9 A research agenda is proposed for investigating the methodological issues that remain unsolved. 
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